From N-Grams to Collocations: An Evaluation of Xtract
نویسنده
چکیده
In previous papers we presented methods for retrieving collocations from large samples of texts. We described a tool, X t r a c t , that implements these methods and able to retrieve a wide range of collocations in a two stage process. These methods a.s well as other related methods however have some limitations. Mainly, the produced collocations do not include any kind of functional information and many of them are invalid. In this paper we introduce methods that address these issues. These methods are implemented in an added third stage to X t r a c t that examines the set of collocations retrieved during the previous two stages to both filter out a number of invalid collocations and add useful syntactic information to the retained ones. By combining parsing and statistical techniques the addition of this third stage has raised the overall precision level of X t r a c t from 40% to 80% With a precision of 94%. In the paper we describe the methods and the evaluation experiments. 1 I N T R O D U C T I O N In the past, several approaches have been proposed to retrieve various types of collocations from the analysis of large samples of textual data. Pairwise associations (bigrams or 2-grams) (e.g., [Smadja, 1988], [Church and Hanks, 1989]) as well as n-word (n > 2) associations (or n-grams) (e.g., [Choueka el al., 1983], [Smadja and McKeown, 1990]) were retrieved. These techniques automatically produced large numbers of collocations along with statistical figures intended to reflect their relevance. However, none of these techniques provides functional information along with the collocation. Also, the results produced often contained improper word associations reflecting some spurious aspect of the training corpus that did not stand for true collocations. This paper addresses these two problems. Previous papers (e.g., [Smadja and McKeown, 1990]) introduced a. set of tecl)niques and a. tool, X t r a c t , that produces various types of collocations from a twostage statistical analysis of large textual corpora briefly sketched in the next section. In Sections 3 and 4, we show how robust parsing technology can be used to both filter out a number of invalid collocations as well as add useful syntactic information to the retained ones. This filter/analyzer is implemented in a third stage of Xtract that automatically goes over a the output collocations to reject the invalid ones and label the valid ones with syntactic information. For example, if the first two stages of Xtract produce the collocation "make-decision," the goal of this third stage'is to identify it as a verb-object collocation. If no such syntactic relation is observed, then the collocation is rejected. In Section 5 we present an evaluation of Xtract as a collocation retrieval system. The addition of the third stage of Xtract has been evaluated to raise the precision of X t r a c t from 40% to 80°£ and it has a recall of 94%. In this paper we use examples related to the word "takeover" from a 10 million word corpus containing stock market reports originating from the Associated Press newswire. 2 FIRST 2 STAGES OF X T R A C T , P R O D U C I N G N G R A M S In a f i r s t stage, X t r a c t uses statistical techniques to retrieve pairs of words (or bigrams) whose common ap pearances within a single sentence are correlated in the corpus. A bigram is retrieved if its frequency of occurrence is above a certain threshold and if the words are used in relatively rigid ways. Some bigrams produced by the first stage of X t r a c t are given in Table 1: the bigrams all contain the word "takeover" and an adjective. In the table, the distance parameter indicates the usual distance between the two words. For example, distance = 1 indicates that the two words are frequently adjacent in the corpus. In a second stage, X t r a c t uses the output bigrams to produce collocations involving more than two words (or n-grams). It examines all the sentences containing the bigram and analyzes the statistical distribution of words and parts of speech for each position around the pair. It retains words (or parts of speech) occupying a position with probability greater than a given
منابع مشابه
Using Synonym Relations in Chinese Collocation Extraction
A challenging task in Chinese collocation extraction is to improve both the precision and recall rate. Most lexical statistical methods including Xtract face the problem of unable to extract collocations with lower frequencies than a given threshold. This paper presents a method where HowNet is used to find synonyms using a similarity function. Based on such synonym information, we have success...
متن کاملINFO256 Project Report Implementation and Evaluation of Xtract in WordSeer
Natural languages are full of word collocations that frequently co-occur and correspond to arbitrary word usages. They appear in both technical and non-technical textual corpora and often have specific significance in individual contexts. Accurately retrieving and identifying collocations from a given corpus in an unsupervised manner is imperative to understanding and automatically generating t...
متن کاملRetrieving Collocations from Text: Xtract
Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and nontechnical genres. Several approaches have been proposed to ...
متن کاملAutomatically Extracting and Representing Collocations for Language Generation
Collocational knowledge is necessary for language generation. The problem is that collocations come in a large variety of forms. They can involve two, three or more words, these words can be of different syntactic categories and they can be involved in more or less rigid ways. This leads to two main difficulties: collocational knowledge has to be acquired and it must be represented flexibly so ...
متن کاملThe Identification and Classification of Unknown Words in Chinese An N-Grams-Based Approach
In this paper, we propose a new approach to identify unknown words in Chinese. This approach adopts an n-grams program to sort out the collocating word / character sequences which are possible words and phrases in Chinese. In addition to proposing the criteria for identifying Chinese new words, was also classify these new words according to their structural and semantic characteristics. The cor...
متن کامل